_Name:_Jacob Hanimann

Question 0

Load all the libraries that you use in the rest of the analyses.

Answer:

library(tidyverse)
library(pheatmap)
library(viridis)
library(ggrepel)
library(DESeq2)

Question 1

Introduction

Antisense transcription is a current hot topic in genomics: when a certain genomic region is transcribed on both strands. In one model, the two RNAs produced may hybridize with each other since they are complementray, and then be degraded (a type of negative regulators). In another, deregulation happens becasue of RNA polymerase collisions.

We will now try to answer the question : How common is this in the human genome?

Use the refseq track from the hg19 assembly.

Using R and bedtools (and possibly a text editor to add headers to bed files if needed), find out:

  • 1: What fraction of refseq transcripts overlap another transcript on the opposite strand, covering at least 20% of the transcript?

Answer:

First, I downloaded two tables from the table browser in genome browser selecting: the hg19 genome, group = “gene and gene prediction”, track = “NCBI Refseq” and table = “RefSeq All”. By using the filter option, i created one table with only “+” strand transcripts and one with only “-” strand transcripts. Both tables were downloaded in the BED format with 12 columns (organised in the bedtool format).

Next, i used UNIX to to create a new table with an additional column displaying how many overlaps each “+” strand transcript has with “-” strand transcripts (covering at least 20% of the transcript) using the following command:

“betools intersect -a hg19_refseq_plus_txt -b hg19_refseq_minus.txt -c -f 0.2 > plus_minus_intersect_count.txt”

Here as a picture:

Note, that slightly different results would be calculated if you switched the order of “+” and “-” strand in the command line, since there is a threshold of 20%. In the following analysis, I will stick to the case of “+” strand as the reference because both ways show enough similar results when just looking at the fraction that overlap at least one opposite transcript.

First, I read in the data and deleted the columns i do not need and renamed the rows. Then, I calculated the ratio between transcript with overlaps 0< over all “+” strand transcripts through counting rows.

#read in downloaded data from UNIX
#intersect
plus_strand_intersect_count = as_tibble(read_tsv("plus_minus_intersect_count.txt", col_names = FALSE))

#plus strand (just needed for counting total transcript)
plus_strand = as_tibble(read_tsv("hg19_refseq_plus.txt", col_names = FALSE))

#minus strand (just needed for counting total transcript)
minus_strand = as_tibble(read_tsv("hg19_refseq_minus.txt", col_names = FALSE))

#deleting unneccessary columns:
plus_strand_intersect_count[5:12] <- list(NULL)

#naming columns
colnames(plus_strand_intersect_count) <- c("chrom","start","end","name","overlaps")

#calculate ratio between transcript with overlaps 0< over all "+" strand transcripts through counting rows
plus_strand_intersect_count %>% filter(overlaps>0) %>% summarise(n()) / summarise(plus_strand_intersect_count, n())
n()
0.0770338
#values of the calculation = 3050/39593 = 0.07703382    

Apparently, approximately 7.7% of all “+” strand transcript have at least one overlap with a “-” transcript, covering at least 20% of the transcript.

The fraction of all Refseq transcripts overlapping opposite strand transcripts (“+” strands with at least one overlap) / (total number of “+” and “-” transcripts) is:

plus_strand_intersect_count %>% filter(overlaps>0) %>% summarise(n()) / (nrow(plus_strand)+nrow(minus_strand))
n()
0.039198

3050 of 77810 transcripts, respectively about 3.9 %, have at least one overlap with the opposite strand (calculated with “-” aligned to “+” strands with a threshold of 20%)


  • 2: What is the distribution of of overlaps on opposite strand? E.g: plot an appropriate histogram how many refseq transcripts have 1,2,3…N opposite strand overlapping transcripts?

Answer:

Because you could get significantly different results depending of which strand you align to which, I calculated the opposite case of “-” strand as a reference followingly with the same pipeline as before:

UNIX command: “bedtools intersect -a hg19_refseq_minus.txt -b hg19_refseq_plus.txt -c -f 0.20 > minus_plus_intersect_count.txt”

I read the new data in and created one large tibble with both datasets combined and annotated each transcript with a column called strand with the value of either “plus” or “minus”.

#read in downloaded data from UNIX "-" as a reference
#intersect
minus_strand_intersect_count = as_tibble(read_tsv("minus_plus_intersect_count.txt", col_names = FALSE))

#deleting unneccessary columns:
minus_strand_intersect_count[5:12] <- list(NULL)

#annotating data with strand orientation
plus_strand_intersect_count = cbind(plus_strand_intersect_count, matrix(c("plus"), nrow = nrow(plus_strand_intersect_count), ncol=1))

minus_strand_intersect_count = cbind(minus_strand_intersect_count, matrix(c("minus"), nrow = nrow(minus_strand_intersect_count), ncol=1))

#naming columns
colnames(minus_strand_intersect_count) <- c("chrom","start","end","name","overlaps", "strand")
colnames(plus_strand_intersect_count) <- c("chrom","start","end","name","overlaps", "strand")

#binding datasets
total_intersect_count = rbind(plus_strand_intersect_count,minus_strand_intersect_count)

I plotted two histogram of overlap counts on the x-axis and counts of the transcript on the y-axis, starting at 1 overlap. I splitted the plots by strand annotation plus and minus (indicating alignment order).

#plotting a histogram of overlaps as the x-axis and counts on the y-axis starting at 1 overlap and a binwidth of 1
total_intersect_count %>% filter(overlaps >0) %>% ggplot(aes(x=overlaps, fill= strand)) + geom_histogram(binwidth = 1) + theme_bw() + facet_grid(~strand, scales = "free")

Looking at the Histogramm plots, it is apparent that most transcript have an overlap of one transcript of the opposite strand and the distribution looks like a exponential decay graph. There are also a few outliners from overlap count 20 to 41 (minus strand) to 81 (plus strand).


  • 3: What is the Refseq transcript with the largest number of overlapping transcripts on the other strand (same thresholds as above)? Show this in the genome browser, disccuss the image and suggest how we can improve the anlaysis based on this (max 200 words).

Answer:

I arranged the total_intersect_count tibble in a top-down manner with the criteria of overlaps.

total_intersect_count %>% arrange(desc(overlaps)) %>% head(5)
chrom start end name overlaps strand
chr18 34854422 34856363 NR_134588.1 81 plus
chrY 15591056 15591146 NR_162134.1 77 plus
chr8 53106921 53112112 NR_134310.1 54 plus
chr10 31650355 31688684 NM_001368169.1 41 minus
chr19 57323847 57325161 NR_023847.2 39 plus

Apparently, the transcript with the most overlapping transcript (strand=plus) with a count of 81 is labeled as NR_134588.1, is located on chromosome 18 and spans from position 34854422 to 34856363.

NR_134588.1 is annotated in RefSeq as the exons of the uncharacterized gene LOC105372068 which is classified as a non-coding RNA. On the opposite strand of this locus is the CELF4 gene, which is described as being part of the protein family that regulate pre-mRNA alternative splicing. Evidently, 81 transcript variants encoding different isoforms have been found for this gene.

It is apparent that the CELF4 gene is significantly longer than LOC105372068.

Based on this finding, it would make more sense to count a gene as one overlap and not each transcript of it as a individual count. How many isoforms of a gene exists, does not determine how much it will be transcripted and potentially interfere with the transcript of the opposite strand. Also, this approach to investigate antisense transcription is rather unspecific and vague. I would suggest to analyze tissue specific, since not all genes are expressed in all cell type. Furthermore, the timing of the transcription is also crucial to the potential of interfering with the antisense strand.


Question 2

Introduction

The RRP40 gene is part of the exosome complex, a molecular machine that degrades RNAs in the cells from the 3’ end. Your collaborator has just made a CAGE experiment in cells in which RRP40 was depleted using a RRP40-specific siRNA. For comparison,he/she also made a control experiment where a random siRNA was used. The hope is to be able to identify what RNAs that are degraded by RRP40, becasue we should observe higher levels of these if RRP40 is depleted. They have a hypothesis that there may be RNAs transcription initiation close to known gene transcription start sites (TSSs) that we never observe in normal cells becasue the RNAs are eaten up so fast that.

The CAGE reads are already mapped to the genome and another collaborator has already made some files for you that shows where they fall around annotated TSSs. Specifically, you are given 4 files where the rows are the -600 to +400 region around ~12.000 annotated TSSs, and the columns are the positions (so, the first is position -600 etc for respective TSS). Values are TPM-normalized CAGE counts plus a small amount of artificially added noise to avoid model overfitting, aside from the first column which is just the genomic position we are looking at, eg chr4:10000-11000+.

Because you have two experiments and two strands, you have four files in total. For example, the Hw4_CAGE_rrp40_senseStrand file has CAGE data on the plus strand and from the RRP40 depletion experiment. “Strand” is here always relative to the annoatated TSS, which is always defined to be on the plus strand.

These files are quite big (around 12 million data points, or 60-70 megabyte each), and your collaborators belatedly realized they could not plot these using Excel. Panic ensued. This is why they hired you: you know how to use R and your job is now to analyze the data and visualizing the results.

Specifically, they want you to:

Using tidyverse (except when reading in files, see below), make a plot where the Y axis is average fold change (rrp40/ctrl) and X axis is position relative to TSS (-600 to +400). Calculate fold changes for each strand so in the end, you will have a plot with two fold change ‘lines’, one for each strand. Interpret the results: What are we seeing and does this agree with the text-book decription of promoters and transcription start sites? ( max 100 words)

Answer:

After reading the data in, I created two matrices which calculated the fold change of the expression and then the mean of each position relative to the TSS. Then, I annotated the two matrices with the position and strand information (sense/antisense), renamed the column so that I could merge them. In the last step i plotted the data with a color code for thre strand feature.

#read in data
sense_control = as_tibble(read_tsv("Hw4_CAGE_Ctrl_senseStrand.txt"))
antisense_control = as_tibble(read_tsv("Hw4_CAGE_Ctrl_antisenseStrand.txt"))
sense_rrp40 = as_tibble(read_tsv("Hw4_CAGE_rrp40_senseStrand.txt"))
antisense_rrp40 = as_tibble(read_tsv("Hw4_CAGE_rrp40_antisenseStrand.txt"))

#create two new matrix with (rrp40/ctr) to calculate fold change and then the average of each position
average_fc_sense = as_tibble(colMeans((sense_rrp40[,-1]/sense_control[,-1])))
average_fc_antisense = as_tibble(colMeans((antisense_rrp40[,-1]/antisense_control[,-1])))

#create two matrixes for the data with position (1:1000) and strand information (sense\antisense) and then bind it to the average fold change value of the position
#sense
matrix_sense = cbind(matrix(c("sense"), nrow = 1000, ncol=1),matrix(c(1:1000),
nrow = 1000, ncol=1),average_fc_sense)
#antisense
matrix_antisense = cbind(matrix(c("antisense"), nrow = 1000, ncol=1), matrix(c(1:1000), nrow = 1000, ncol=1), average_fc_antisense)

#renaming the column names
colnames(matrix_sense) <- c("strand", "position", "foldchange")
colnames(matrix_antisense) <- c("strand", "position", "foldchange")

#merging both strand data sets to get one tibble
both_strands <- as_tibble(rbind(matrix_sense, matrix_antisense))

#plotting data
both_strands %>% ggplot(aes(x= position, y= foldchange, col= strand))+ geom_point(alpha=0.5) +theme_bw()

According to this plot, there was a notably bigger increase in transcription of antisense RNA compared to the sense RNA when the siRNA depleted the RRP40 gene. Sense RNA transcription was also upregulated, but not that aberrant. This data suggests that the target of the exosome complex component RRP40 is anti-sense RNA nearby and upstream of the TSS. Also, the exosome complex is described (https://www.uniprot.org/uniprot/Q08285) as participating in the elimination of RNA-processing by-products non-coding ‘pervasive’ transcripts, including antisense RNA-species. This annotation matches these observation of significantly more antisense-transcripts near TSS in the RRP40 mutant.

Question 3

Introduction

You have been hired by the Danish pharmaceutical giant Novo Nordisk to analyze an RNA-Seq study they have recently conducted. The study involves treatment of pancreatic islet cells with a new experimental drug for treatment of type 2 diabetes. Novo Nordisk wants to investigate how the drug affects cellular mRNA levels in general, and whether the expression of key groups of genes are affected.

As the patent for the new experimental drug is still pending, Novo Nordisk has censored the names of genes.

You have been supplied with 4 files:

  • studyDesign.tsv: File describing treatment of the 18 samples included in the study.
  • countMatrix.tsv: Number of RNA-Seq reads mapping to each of the genes.
  • normalizedMatrix.tsv: Normalized expression to each of the genes.
  • diabetesGene.tsv: Collection of genes known to be involved in type 2 diabetes.

Part 1: Exploratory Data Analysis

Question 3.1.1: Read all dataset into R, and make sure all three files have matching numbers and names of both samples and genes.

Answer:

count_matrix = read.table(file = 'countMatrix.tsv', sep = '\t')
diabetes_genes = read_tsv("diabetesGenes.tsv")
normalizedMatrix = read.table(file = 'normalizedMatrix.tsv', sep = '\t')
studyDesign = read_tsv("studyDesign.tsv")

Next, we want to see if the data makes sense, by making a heat map and a PCA plot.

Question 3.1.2: Heat map: For heat maps, it makes no sense to include all genes - instead, we will only look at genes that vary substantially across the samples. Specifically, select the genes top 10% of genes based on their variance across all samples, and make a heat map of those using the pheatmap library (standard settings). Rows in the heat mpa should be genes, columns shoudl be samples. Make an annotation row that shows whether each sample is treatment or control. Comment on your plot

Answer:

First, I calculated the variance score for each gene and added it as a column to the normalizedMatrix. Then i created a annotation matrix to later use in the heatmap to differ between treated and control samples. Lastly, i selected the top 10% of the highest variation score and plotted it as a heatmap.

#calculating the variance of each row
variance = normalizedMatrix  %>% apply(., MARGIN=1, FUN=var) %>% as_tibble

#adding the variance for each gene to the normalizedMatrix
nmatrix_var = cbind(normalizedMatrix,variance)

#annotation sample or treatment for heatmap
col<-data.frame(studyDesign[,-1])
row.names(col)<- studyDesign$Sample

#selecting the top 10% genes with the highest variance and plotting it with heatmap
nmatrix_var %>% arrange(desc(value))%>% top_n(.,(round(0.1*nrow(.)))) %>% select(-value)%>% pheatmap(.,color = magma(10),annotation_col =col)

The clustering of the expression patter divides the samples in two groups which matches the criteria of control and treated. There are two exception to this pattern which are Sample 6 (control) and Sample 18 (treated). It looks like the annotation of these samples were switched.

Question 3.1.3: PCA: Using the normalized matrix (all genes, not the top 10% of genes as in the heat map), perform a Principal Components Analysis (PCA) on the samples and produce a PCA-plot of the two first components, where the axis labels show the amount of variance explained by each component and samples are colored by their experimental group. Find a way to label the samples, so the identity (the sample name) of each point can easily be seen (hint: look at geom_text() or the ggrepel package!). Note, you should center but not scale the data. Comment on your plot

Answer:

I performed a pca by transposing the matrix so that the genes function as dimensions.

#perform pca on normallizedMatrix and looking at summary
pca_exp = normalizedMatrix %>% t() %>% prcomp(.,center=TRUE, )

#extracting variance
percent_variance_exp <- summary(pca_exp)$importance["Proportion of Variance",] * 100

#annotate data with sample identity and then plot
as_tibble(pca_exp$x) %>% bind_cols(studyDesign) %>%
ggplot(.,aes(y= PC2, x=PC1, col=Condition, label= Sample))+ ggrepel::geom_text_repel() + geom_point()+ theme_bw() + xlab(label = paste("PC1", percent_variance_exp[1])) + ylab(label = paste("PC2", percent_variance_exp[2]))

The PCA shows like the heatmap a distinct clustering between control and treated samples. Again, Sample 6 and Sample 18 are located in the opposite group, which support the hypothesis of these two samples being wrongly annotated (switched).

I did not observe any grouping in the PC2/PC3 plot which I could make sense of.

Question 3.1.4: Based on the two previous questions, discuss (max 50 words) whether your observations indicate that there are any problems with the data - e.g. outliers, mix ups, sub-groups. If you identified problems try to fix them (e.g. remove clear outliers if you find them, fix mix-ups, etc ). If you make a correction, make a PCA plot with your corrected data to check that the correction is doing the right thing

Answer:

Like metioned twice before, I suggest that the classification of Sample 6 (Ctrl) and Sample 18 (Trt) were mixed up and that they are in fact part of the other group.

I corrected the annotation by creating corrected version of the studyDesign data frame:

#corr stands for corrected
corr_studyDesign = studyDesign
corr_studyDesign[6,2] = "Trt"
corr_studyDesign[18,2] = "Ctrl"

#plotting PCA with corr_studyDesign as annotation_matrix
as_tibble(pca_exp$x) %>% bind_cols(corr_studyDesign) %>%
ggplot(.,aes(y= PC2, x=PC1, col=Condition, label= Sample))+ ggrepel::geom_text_repel() +geom_point()+ theme_bw() + xlab(label = paste("PC1", percent_variance_exp[1])) + ylab(label = paste("PC2", percent_variance_exp[2]))

With the new annotation the samples are smoothly seperated by their classification sample/control.

Part 2: Differential Expression (DE)

Question 3.2.1: Use DESeq2 to obtain differentially expressed (DE) genes between the two experimental conditions. Use default parameter, except use a logFC threshold of 0.25 and an adjusted P-value threshold of 0.05. How many up-and down regulated genes are there on your corrected data compared to if you do the Deseq2 analysis on un-corrected data?

Answer:

I conducted an analysis with corrected (corr) and uncorrected data (uncorr). First, I saved the data as DESeqDatSet-object. Then, I ran the DESeq analysis and displayed the results with the wanted parameters with the result function and subsequently with the summary function of the library.

#save data as DESeqDatSet-object
#uncorrected
dds_uncorr <- DESeqDataSetFromMatrix(countData = count_matrix,
  colData = studyDesign,
  design = ~ Condition) 

#corrected
dds_corr <- DESeqDataSetFromMatrix(countData = count_matrix,
  colData = corr_studyDesign,
  design = ~ Condition) 

#run DESeq
dds_uncorr <- DESeq(dds_uncorr)
dds_corr <- DESeq(dds_corr)

#see results with adjusted parameters: contrast states in which order the DE should be calculated (Treatment]\Control)
#uncorrected
res_uncorr <- results(dds_uncorr,
contrast=c("Condition", "Trt", "Ctrl"),
lfcThreshold=0.25, # logFC cutoff, instead of 0
alpha=0.05) # adjusted FDR cutoff

#corrected
res_corr <- results(dds_corr,
contrast=c("Condition", "Trt", "Ctrl"),
lfcThreshold=0.25, # logFC cutoff, instead of 0
alpha=0.05) # adjusted FDR cutoff

#Summary of uncorrected Data set
summary(res_uncorr)
## 
## out of 5669 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0.25 (up)    : 160, 2.8%
## LFC < -0.25 (down) : 194, 3.4%
## outliers [1]       : 0, 0%
## low counts [2]     : 550, 9.7%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
#Summary of corrected Data set
summary(res_corr)
## 
## out of 5669 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0.25 (up)    : 653, 12%
## LFC < -0.25 (down) : 873, 15%
## outliers [1]       : 0, 0%
## low counts [2]     : 110, 1.9%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

In the uncorrected data the analyses states 160 (2.8%) upregulated and 194 (3.4%) downregulated genes, whereas the corrected data results in 653 (12%) upregulateda and 873 (15%) downregulated genes. This is significantly more and crucial for further analysis.

Question 3.2.2: From now on, we will only analyze the corrected data. Convert the output of DESeq2 (corrected data) to a tibble, and make an MA-plot using ggplot2. The MA-plot should show the overall trend using a trend line and genes should colored according to their DE status. Discuss whether the MA-plot indicates an approriate DESeq2 analysis (max 70 words discussion).

Answer:

#transfrom the DESeq data format to a tibble
res_corr_tibble <-res_corr %>%
as.data.frame %>% 
rownames_to_column("Gene") %>% 
as_tibble

#MA-plot from tibble
res_corr_tibble %>% ggplot(., aes(x=baseMean, y=log2FoldChange, col = ifelse(padj > 0.05 & padj==NA_character_,"",'significant'))) + 
  geom_point(alpha=0.5) + geom_smooth(col="blue") + 
  scale_x_log10() + 
  geom_hline(yintercept = 0, alpha = 0.75,
  color="red")+  theme_bw() + theme(legend.position = "none")

The MA-plot indicates that proper DE analysis can be conducted, since it seems like the normalization was successfull because the gene distribution is around log2F = 0, which is the assumption that is made. Also, there is a decently amount of significant foldchanges and not too many. There is an outlier up-right in the plot which contributes to the slope of the trend line.

Question 3.2.3: Sort the DE statistics table that you get from DESeq2 to report the top 10 genes sorted by

a) positive logFC (highest on top)

#arrange the tibble top-down, log2FoldChange as a criteria
res_corr_tibble %>% arrange(desc(log2FoldChange)) %>% head(10)
Gene baseMean log2FoldChange lfcSE stat pvalue padj
Gene861 32093.778217 5.832809 0.4546667 12.278903 0.0000000 0.0000000
Gene4975 8.280601 5.598001 1.4776725 3.619206 0.0002955 0.0023170
Gene4582 1104.622721 5.520796 0.4297880 12.263712 0.0000000 0.0000000
Gene5510 11.872892 5.458462 1.3509775 3.855329 0.0001156 0.0010823
Gene541 24.100684 5.127973 0.7414828 6.578672 0.0000000 0.0000000
Gene2038 65.191615 5.119450 0.9021305 5.397722 0.0000001 0.0000022
Gene1950 3.141275 4.844799 1.1362548 4.043810 0.0000526 0.0005622
Gene4259 18.900008 4.816041 1.2009566 3.802003 0.0001435 0.0012828
Gene5215 467.146864 4.573887 0.4473346 9.665890 0.0000000 0.0000000
Gene1910 869.338646 4.548284 0.4964889 8.657362 0.0000000 0.0000000

b) negative logFC (lowest on top)

#arrange the tibble down to top, log2FoldChange as a criteria
res_corr_tibble %>% arrange(log2FoldChange) %>% head(10)
Gene baseMean log2FoldChange lfcSE stat pvalue padj
Gene2502 28.830140 -6.018185 0.6684912 -8.628662 0.00e+00 0.0000000
Gene5604 5.274646 -5.504877 0.9705110 -5.414547 1.00e-07 0.0000021
Gene5099 6.342716 -5.449355 0.8730538 -5.955365 0.00e+00 0.0000001
Gene1613 7.846785 -5.431049 0.9441293 -5.487648 0.00e+00 0.0000015
Gene3160 4.893259 -5.392966 0.9065088 -5.673376 0.00e+00 0.0000006
Gene3593 7.044358 -5.271785 1.0319123 -4.866485 1.10e-06 0.0000238
Gene77 3.892940 -5.056671 1.0754914 -4.469279 7.80e-06 0.0001232
Gene4597 3.065637 -5.042683 1.0635378 -4.506359 6.60e-06 0.0001063
Gene1056 98.450449 -4.981815 0.4486604 -10.546540 0.00e+00 0.0000000
Gene5282 3.681568 -4.973660 1.1100428 -4.255385 2.09e-05 0.0002643

only looking at significantly differentially expressed genes

#arrange the tibble top-down, padj as a criteria
res_corr_tibble %>% arrange(padj) %>% head(10)
Gene baseMean log2FoldChange lfcSE stat pvalue padj
Gene861 32093.77822 5.832809 0.4546667 12.278903 0 0
Gene4582 1104.62272 5.520796 0.4297880 12.263712 0 0
Gene1056 98.45045 -4.981815 0.4486604 -10.546540 0 0
Gene5215 467.14686 4.573887 0.4473346 9.665890 0 0
Gene2676 78.59401 4.353638 0.4328948 9.479528 0 0
Gene1684 116.26014 4.510864 0.4586087 9.290849 0 0
Gene423 510.03984 4.161737 0.4305440 9.085569 0 0
Gene3031 145.59970 -3.639055 0.3866102 -8.766079 0 0
Gene3126 37.24829 -4.725719 0.5141028 -8.705883 0 0
Gene1910 869.33865 4.548284 0.4964889 8.657362 0 0

Part 3: Is the drug any good?

Question 3.3.1: Novo Nordisk claims their treatment affects expression of genes related to diabetes. Your task is to investigate whether this is true. They have supplied you with a long list of genes that are diabetes-related - diabetesGenes.tsv. Are these genes more up/down regulated than expected by chance, by looking at log2FC values from above ?

Answer:

To visualize the log2FC, p-value and adjusted p-value of these diabetes-related genes i made a volcano-plot.

#filter the diabetes_genes out of the result tibble from the DESeq analysis and plotting the volcano plot
res_corr_tibble %>% filter(Gene %in% diabetes_genes$Gene) %>% ggplot(., aes(x=log2FoldChange, 
    y=-log10(pvalue), color=padj < 0.05)) + 
  geom_point(alpha=0.5) +
  geom_vline(xintercept = 0, alpha = 0.75, 
  linetype="dashed")+theme_bw()

It is visible that there are genes which are significantly up/down regulated in regards to the multiple testing adjusted p-value (padj) being < 0.05. Next, I investigated if this treatment affects specifically diabetes-related genes or affect non-diabetes-related genes the same way.